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(57) Abstract 

The present invention provides an instruction fetch unit aligner. In one embodiment, an apparatus for an instruction fetch unit aligner 
includes selection logic for selecting a non-power of two size instruction from power of two size instruction data, and control logic for 
controlling the selection logic. 
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AN INSTRUCTION FETCH UNIT ALIGNER 

TECHNICAL FIELD 

The present invention relates generally to microprocessors, and more particularly, to an instruction fetch 
unit aligner. 

5 BACKGROUND ART 

A microprocessor typically includes a cache memory for storing copies of the most recently used 
memory locations. The cache memory generally is smaller and faster than main memory (e.g., disk). A 
microprocessor also typically includes an instruction prefetch unit that is responsible for prefetching instructions 
for a CPU (Central Processing Unit). In particular, an instruction cache unit is typically organized in a way that 
10 reduces the amount of time spent transferring instructions having a power of two size into the pr&fetch unit. For 
example, a 256-bit bus (256 bits = 4x8 bytes - 32 bytes) connecting the instruction cache unit and the prefetch 
unit allows a 32-byte instruction prefetch unit to fetch 32 bytes of instruction data in a single cycle of the 
microprocessor. 

DISCLOSURE OF INVENTION 

15 The present invention provides an instruction fetch unit aligner. For example, the present invention 

provides a cost-effective and high performance apparatus for an instruction fetch unit of a microprocessor that 
executes instructions having a non-power of two size. 

In one embodiment, an apparatus for an instruction fetch unit aligner includes selection logic of an 
instruction aligner that extracts and aligns a non-power of two size instruction (e.g., 5, 10, 15, or 20 bytes of 

20 instruction data) from power of two size instruction data (e.g., 64 bytes of instruction data), and control logic of 
the instruction aligner for controlling the selection logic. The selection logic is implemented as multiplexer logic 
for selecting the non-power of two size instruction from the power of two size instruction data. The extraction 
and alignment of the non-power of two size instruction from the power of two size instruction data is performed 
within one clock cycle of the microprocessor. For example, four 2:1 multiplexers that each select 8 bytes of the 

25 power of two size instruction data can be used to select 32 bytes of instruction data from 64 bytes of instruction 
data, in which the non-power of two size instruction is within the selected 32 bytes of instruction data, and the 
multiplexer logic provides 32: 1 mux functionality using eight 4: 1 multiplexers and four 8: 1 multiplexers for 
every 4 bits of the power of two size instruction data. A reorder channel that appropriately reorders the bits 
output from the multiplexer logic is also provided. 

30 Other aspects and advantages of the present invention will become apparent from the following detailed 

description and accompanying drawings. 



BNSDOCID: <WO_0033180A2_I_> 



WO 00/33180 PCT/US99/28873 



BRIEF DESCRIPTION OF DRAWINGS 

FIG. 1 is a block diagram of a microprocessor that includes an instruction fetch unit in accordance with 
one embodiment of the present invention. 

FIG. 2 shows various formats of instructions having a non-power of two size. 

5 FIG. 3 is a block diagram of an instruction queue and the instruction fetch unit of FIG. 1 shown in 

greater detail in accordance with one embodiment of the present invention. 

FIG. 4 is a functional diagram of the instruction cache unit of FIG. 1 connected to the instruction fetch 
unit of FIG. 1 in accordance with one embodiment of the present invention. 

FIG. 5 is a diagram of possible 5-byte instruction positions within a 32-byte wide cache memory. 

10 FIG. 6 is a functional diagram of the operation of the instruction fetch unit of FIG. 4 shown in greater 

detail in accordance with one embodiment of the present invention. 

FIG. 7 is a functional diagram of a multi-level implementation of the instruction aligner of FIG. 3 in 
accordance with one embodiment of the present invention. 

FIG . 8 is a block diagram of the line buffers connected to the double word muxes of the instruction fetch 
1 5 unit of FIG. 6 shown in greater detail in accordance with one embodiment of the present invention. 

FIG. 9 is functional diagram of the operation of the rotate and truncate (RAT) unit of FIG. 6 shown in 
greater detail in accordance with one embodiment of the present invention. 

FIG . 10 is a functional diagram of a symbolic implementation of the RAT unit of FIG. 6 in accordance 
with one embodiment of the present invention. 

20 FIG. 1 1 is a functional diagram of a RAT bit ordering in accordance with one embodiment of the 

present invention. 

FIG. 12 is a functional diagram of a RAT physical implementation in accordance with one embodiment 
of the present invention. 

FIG. 13 is a functional diagram of an input byte ordering of each four byte group that allows the mux's 
25 select control signals to be shared in accordance with one embodiment of the present invention. 

FIG. 14 is a block diagram of the instruction queue of FIG. 3 shown in greater detail in accordance with 
one embodiment of the present invention . 
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MODES FOR CARRYING OUT THE INVENTION 

A typical instruction set architecture (ISA) for a microprocessor specifies instructions has a power of 
two size, which can be aligned on a power of two boundary in a conventional cache memory. A typical ISA 
includes 32-bit instructions that are a fixed size such as for RISC (Reduced Instruction Set Computer) processors. 
5 The 32-bit instructions are typically aligned on a 32-bit boundary in a conventional instruction cache unit. The 
32-bit instructions can be prefetched from the instruction cache unit in one clock cycle using a conventional 32- 
bit data path between the prefetch unit and the instruction cache unit. 

However, new instruction set architectures may include instructions having a non-power of two size. To 
efficiently fetch instructions having a non-power of two size, a method in accordance with one embodiment of 

10 the present invention includes fetching at least two sequential cache lines for storage in line buffers of an 

instruction fetch unit of a microprocessor, and then efficiently extracting and aligning all the bytes of a non- 
power of two size instruction from the line buffers. This approach allows for a standard instruction cache 
architecture, which aligns cache lines on a power of two boundary, to be used. This approach also reduces the 
data path between the instruction cache and the instruction fetch unit. This approach sustains a fetch of always at 

15 least one sequential instruction per clock cycle of the microprocessor. 

For example, an ISA can require supporting execution of instruction packets such as VLI W (Very Long 
Instruction Word) packets that are either 5, 10, 15, or 20 bytes wide. For certain applications such as graphics or 
media code, there may predominantly be 20-byte wide VLIW packets. If a 20-byte VLIW packet is executed per 
clock cycle (e.g., at a peak execution rate), then to maintain this peak execution rate, the instruction fetch unit 
20 fetches at least 20 bytes per clock cycle from the instruction cache unit. 

FIG. 1 is a block diagram of a microprocessor 100 that includes an instruction fetch unit (IFU) 108 in 
accordance with one embodiment of the present invention. In particular, microprocessor 100 includes a main 
memory 102 connected to a bus 104, an instruction cache unit 106 connected to bus 104, instruction fetch unit 
108 connected to instruction cache unit 106, and PI processor 1 10 and P2 processor 112 each connected to 
25 instruction fetch unit 108. In one embodiment, PI processor 1 10 is provided (i.e., instead of PI processor 110 
and P2 processor 1 12), and PI processor 1 10 is connected to instruction fetch unit 108. 

In one embodiment, instruction cache unit 106 is a conventional 16-kilobyte dual-ported cache that uses 
a well-known (standard) cache architecture of two-way set associative, 32-byte lines (e.g., in order to minimize 
cost and timing risk). Instruction cache unit 106 returns a new 32-byte cache line to instruction fetch unit 108 
30 during each clock cycle of microprocessor 100, and thus, instruction cache unit 106 can satisfy an execution rate 
of, for example, a 20-byte VLIW packet per clock cycle of microprocessor 100. 

However, the 20-byte VLIW packets may not be aligned on the 32-byte cache line boundaries of 
instruction cache unit 106. VLIW packets can start on any byte boundary, and an empirical observation reveals 



BNSDOCID: <WO 00331 80A2_l_> 



WO 00/33180 * . PCT/US99/28873 

- 4 - 

that a significant number of the VLIW packets often start on a first cache line and continue onto a second cache 
line of two sequential cache lines. For VLIW packets that span two cache lines, two clock cycles would typically 
be needed to fetch the entire VLIW packet before executing the VLIW packet. As a result, the execution pipeline 
of microprocessor 1 00 may be reduced to approximately one half, thus resulting in a significant performance 

5 degradation. 

Accordingly, instruction fetch unit 108 storestwo instruction cache lines fetched from instruction cache 
unit 106 to ensure that instruction fetch unit 108 can provide the next VLIW packet, regardless of whether or not 
the VLIW packet spans two cache lines, in a single clock cycle. In particular, instruction fetch unit 108 
prefetches ahead of execution, predicts branch outcomes, and maintains two sequential cache lines of unexecuted 
, 0 instructions. For example, a 20-byte VLIW packet is extracted from the two sequential instruction cache lines of 
instruction fetch unit 108 and then appropriately aligned, and the extraction and alignment is completed in one 
clock cycle (assuming the two sequential cache lines stored in instruction fetch unit 108 represent valid data). 
For sequential execution, instruction fetch unit 108 provides at least one VLIW packet per clock cycle, regardless 
of whether or not the VLIW packet spans two cache lines in instruction cache unit 106. 

15 In one embodiment, instruction cache unit 106 is a shared instruction cache unit for multiple processors 

(e.g., PI processor 1 10 and P2 processor 1 12). 

A typical instruction fetch unit provides a 4-byte granularity. In contrast, instruction fetch unit 108 
provides a 1-byte granularity and can fetch instructions with a 1-byte granularity. Instruction fetch unit 108 
extracts and aligns a 5, 10, 15, or 20 byte VLIW packet from 64 bytes of instruction data stored in instruction 
20 fetch unit 108 (e.g., an instruction cache line of an instruction cache unit 106 is 32-bytes). Instruction fetch unit 
108 efficiently performs the align operation as discussed below. 

FIG. 2 shows various formats of instructions having a non-power of two size. In particular, instruction 
format 202 shows an instruction format for a variable size opcode which includes an 8-bit to 16-bit opcode, a 6- 
bit to 10-bit destination, a 6-bit to 10-bit source 1, a 6-bit to 10-bit source 2, and a 6-bit to 10-bit source 3. 
25 Format 202 ranges from 32 bits to 56 bits. Instruction format 204 shows a 40-bit instruction format which 
includes an 8-bit opcode, an 8-bit destination, an 8 bit source 1, an 8-bit source 2 and an 8-bit source 3. 

Storing non-power of two size instructions, such as shown in instruction format 204, in a conventional 
DRAM (Dynamic Random Access Memory) or other conventional cache memory that includes cache lines of 
power of two size (e.g., because of binary addressing) results in non-aligned instructions being stored in the 
30 instruction cache. Thus, one embodiment of the present invention allows for the fetching of non-power of two 
size instructions from an instruction cache unit in one clock cycle of the microprocessor. For example, a typical 
DRAM has a width of a power of two number of bits (e.g., 32 bytes). Similarly, on-chip memory is typically 
organized using power of two boundaries and addressing. Thus, non-power of two instruction sets, such as 
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shown in the instruction format 204 (i.e., a forty bit or five byte instruction), are not necessarily aligned when 
stored in instruction cache unit 106. 

FIG. 3 is a block diagram of an instruction queue 302 and instruction fetch unit 108 shown in greater 
detail in accordance with one embodiment of the present invention. Instruction fetch unit 108 is connected to 
instruction cache unit 106 via a conventional 32-byte data path. Instruction fetch unit 108 includes a prefetch 
unit 304. Prefetch unit 304 includes dual in-line buffers 306. Dual in-line buffers 306 are implemented as, for 
example, two 32-byte wide registers. Dual in-line buffers 306 store two sequential lines of instructions fetched 
from instruction cache unit 106. By storing two sequential lines of instructions fetched from instruction cache 
unit 106, instruction fetch unit 108 essentially ensures that the subsequent instruction is stored in dual in-line 
buffers 306, regardless of whether or not it represents a non-aligned instruction (e.g., the instruction spans two 
lines in instruction cache unit 106). Thus, instruction fetch unit 108 solves the problem of having to request two 
instruction fetches from instruction cache unit 106, which typically causes a waste of at least one clock cycle of 
the microprocessor. 

Instruction fetch unit 108 also includes an instruction aligner 308. Instruction aligner 308 extracts and 
aligns the non-power of two size instruction from instruction data stored in dual in-line buffers 306. For 
example, for a 40-bit instruction, instruction aligner 308 extracts the 40-bit instruction from the 64 bytes of data 
stored in dual in-line buffers 306. Instruction aligner 308 then efficiently aligns the 40-bit instruction, as further 
discussed below. 

In one embodiment, microprocessor 100 includes four processors or CPUs (Central Processing Units). 
Microprocessor 100 executes up to four instructions per cycle. Instruction fetch unit 108 provides up to four 
instructions per cycle to instruction queue 302 to maintain the peak execution rate of four instructions per cycle. 
For example, for a 40-bit instruction set, which defines 40-bit instruction sizes, instruction fetch unit 108 
provides up to 160 bits per cycle in order to provide four instructions per cycle. Thus, instruction fetch unit 108 
provides up to 20-bytes of instruction data (e.g., a 20-byte VLI W packet) to instruction queue 302 per cycle. 
Because dual in-line buffers 306 store 64 bytes of instruction data, instruction aligner 308 is responsible for 
extracting and appropriately aligning, for example, the 20 bytes of instruction data for the next cycle that is 
within the 64 bytes of instruction data stored in dual in-line buffers 306. Accordingly, in one embodiment, an 
efficient method for fetching instructions having a non-power of two size is provided. 

FIG. 4 is a functional diagram of instruction cache unit 106 connected to instruction fetch unit 108 in 
accordance with one embodiment of the present invention. A cache line 402 that includes 32 bytes of instruction 
data stored in instruction cache unit 106 is sent to instruction fetch unit 108 via a 32-byte data path 404. 
Instruction fetch unit 108 includes dual in-line buffers 306. Dual in-line buffers 306 include a line buffer 0 that 
is 32-bytes wide and a line buffer 1 that is 32-bytes wide. For example, line buffer 0 and line buffer 1 can be 
implemented as registers of instruction fetch unit 108, or line buffer 0 and line buffer 1 of dual in-line buffers 
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306 can be implemented as two sets of enable-reset flip-flops, in which the flip-flops can be stacked (two in one 
bit slice). The 32-bytes of data are then extracted from dual in-line buffers 306 and transmitted via a 32-byte 
data path 406 to instruction aligner 308. Instruction aligner 308 extracts and aligns the instruction (e.g., 10 bytes 
of instruction data) from the 32 bytes of instruction data and then transmits the extracted and aligned instruction 
5 for appropriate execution on processors 1 1 0 and 1 1 2 of microprocessor 1 00. 

Dual in-line buffers 306 maintain two sequential lines of instruction data fetched from instruction cache 
unit 106. After the instruction data is extracted from dual in-line buffers 306, instruction fetch unit 108 fetches 
the next sequential line of instruction data for storage in dual in-line buffers 306. For example, based on the 
address of the fetched data (e.g., if the fifth address bit is zero, then the fetched data is loaded into line buffer O, 

10 else the fetched data is loaded into line buffer 1 ), either line buffer 0 or line buffer 1 is purged, and the next 

sequential line of cache memory (e.g., cache line 402 of instruction cache unit 106) is fetched and stored in the 
now purged line buffer 0 or line buffer 1. In steady state mode, instruction fetch unit 108 maintains a rate of 
fetching of 32 bytes of instruction data per cycle. Because only up to 20 bytes of instruction data are consumed 
per cycle in the 20-byte VLIW packet example, and instruction data is stored in memory sequentially, instruction 

15 fetch unit 108 can generally satisfy the peak execution rate of microprocessor 100, such as 20 bytes of instruction 
data or four instructions per multi-processor cycle of microprocessor 100. 

The instruction data path within instruction fetch unit 108 involves, for example, selecting a 20-byte 
wide byte-aligned field from 64 bytes of data stored in dual in-line buffers 306. The 20-byte wide byte-aligned 
field is buffered (e.g., stored in instruction queue 302) and then appropriately presented to the CPUs (e.g., 4 
20 different processors). For a 20-byte VLIW packet, the data path size between instruction cache unit 106 and 
instruction fetch unit 108 can be 32 bytes, because the cache line size is 32 bytes. 

However, extracting a 20-byte wide byte-aligned field from 64 bytes of non-aligned instruction data 
efficiently represents a challenging problem. Accordingly, instruction fetch unit 108 efficiently performs a rotate 
and truncate (RAT) of a 20-byte wide byte-aligned field from 64 bytes of non-aligned instruction data, in which, 
25 for example, 20 bytes is the maximum size of a VLIW packet, and 64 bytes of instruction data is prefetched from 
instruction cache unit 106 in accordance with one embodiment of the present invention, as further discussed 
below. 

FIG. 5 is a diagram of possible 5-byte instruction positions within a 32-byte wide cache memory. Each 
32-byte aligned location is called a cache memory line. An instruction can be located in 32 unique positions in 
30 the five cache memory lines (e.g., cache memory lines 0-5) before the position sequence of FIG . 5 repeats. 

In one embodiment, instruction aligner 308 can select an instruction from any one of these 32 different 
positions along with 0-3 subsequent instructions (assuming a VLIW packet that includes up to four instructions). 
In order to accomplish this task, instruction aligner 308 uses a 5-bit offset pointer indicating where in the 32-byte 
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data path the first byte of the General Functional Unit (GFU) instruction is found for a multiprocessor that 
includes, for example, four different processors such as the GFU and three Media Functional Units (MFUs). 
Instruction aligner 308 then left justifies the first byte along with up to 19 subsequent bytes to provide the 
instruction packet (e.g., the VLIW packet). If the instruction packet spans (i.e., crosses) a cache memory line 
boundary, then instruction aligner 308 combines the contents of line buffer 0 and line buffer 1 of dual in-line 
buffers 306. 

FIG. 6 is a functional diagram of the operation of instruction fetch unit 108 of FIG. 4 shown in greater 
detail in accordance with one embodiment of the present invention. Each quarter of a line buffer (e.g., line buffer 
0 and line buffer 1 of dual in-line buffers 306) includes 8 bytes, or two words, which together represent a double 
word. Thus, each line buffer includes four double words, which together make up an octword. The first double 
word in the line buffer is numbered 0, followed by 1, 2, and 3, respectively. Line buffer 0 (e.g., line buffer 0 of 
dual in-line buffers 306) holds even octwords, because it includes memory lines at even octword addresses (e.g., 
0, 64, and 128). Line buffer 1 (e.g., line buffer 1 of dual in-line buffers 306) holds odd octwords, because it 
includes memory lines at odd octword addresses (e.g., 32, 96, and 160). Instruction fetch unit 108 includes four 
2:1 doubleword muxes to concatenate any four doublewords stored in line buffer 0 and line buffer 1. Four 
doublewords (32 bytes) provide the data needed to extract a 20-byte instruction such as a VLIW packet, which is 
a maximum of 20 bytes (assuming both of the line buffers include valid data). 

In one embodiment, the instruction data path is implemented as an instruction data path megacell that 
includes the following: dual in-line buffers 306 that hold two cache lines (64 bytes in total) fetched from 
instruction cache unit 106, doubleword muxes 602, 604, 606, and 608 that select 32 bytes of instruction data 
from dual in-line buffers 306 to provide aligner input 610, rotate and truncate logic (RAT) unit 61 1 of instruction 
aligner 308 that selects a VLIW packet by left justifying and truncating the 32 bytes presented by the double 
word muxes to provide RAT output 612. 

Specifically, FIG. 6 shows an example of a four instruction VLIW packet starting at byte 15 of line 
buffer 0 of dual in-line buffers 306 and ending at byte 2 of line buffer 1 of dual in-line buffers 306. The VLIW 
packet passes through mux input 0 of doubleword muxes 1 (604), 2 (606), and 3 (608), and mux input 1 of 
doubleword mux 0 (602). The result is a 32-byte aligner input 610 that includes instructions 3, 4, 5, and 6, which 
represent a VLIW packet. Doubleword muxes 602, 604, 606, and 608 represent the first level of muxes that 
select all the doublewords necessary to obtain the minimal power of two size aligned super set of the desired 
VLIW packet (e.g., selects 32 bytes of instruction data that include the 20-byte VLIW packet). Aligner input 610 
is provided to RAT unit 61 1 of instruction aligner 308. RAT unit 6 1 1 performs a RAT function that extracts and 
aligns the 20-byte VLIW packet from 32-byte aligner input 610 and, in particular, rotates and truncates the 32 
bytes of instruction data in order to output 20 bytes of instruction data as RAT output 612 that represents a byte- 
aligned VLIW packet. 
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Referring to the selection of bytes of instruction data stored in dual in-line buffers 306, the selection is 
performed by using the known start address of the VL1W packet, and then extracting the next sequential bytes 
using doubleword muxes 602, 604, 606, and 608 to provide 32-byte aligner input 610. For example, a VLIW 
packet can be 5, 10, 15, or 20 bytes (e.g., it depends on whether or not the compiler generated 1, 2, 3, or 4 
instructions in parallel, that is, for execution in a single cycle on the multi-processor), in which the first two bits 
of the VLIW packet represent a packet header that indicates how many instructions are included in the VLIW 
packet. Thus, when a VLIW packet is decoded, it can be determined that only 10 bytes of instruction data are 
needed (e.g., two instructions were compiled for execution in parallel in a particular cycle). 

Aligner input 610 represents 32 bytes of instruction data within which resides up to 20 bytes of non- 
aligned VLIW data. RAT unit 61 1 performs a RAT operation that extracts and aligns non-power of two size 
instruction data from the power of two size instruction data (e.g., a 20-byte VLIW packet from 32 bytes of 
aligner input 610) to provide RAT output 612. The RAT operation can be implemented using twenty 32: 1 muxes 
using two levels of muxes, eight 4: 1 muxes, each of which connects to an 8: 1 mux to effectively provide a 32: 1 
mux, which represents a brute force approach. However, a more efficient approach is discussed below. 

FIG. 7 is a functional diagram of a multi-level implementation of instruction aligner 308 in accordance 
with one embodiment of the present invention. In particular, instruction aligner 308 is implemented using two 
levels of muxes, which includes a first level mux select 802 and a second level mux select 804. The first level of 
muxes includes eight 4:1 byte-wide muxes. The second level of muxes includes an 8:1 byte-wide mux. 
Logically, there is a two-level mux structure for each bit of the 20 bytes input to instruction aligner 308. Mux 
select controls 802 and 804 are updated every cycle in order to sustain alignment of one VLIW packet per cycle. 
For example, instruction aligner 308 can be implemented as a megacell that is organized with a stacked bit cell 
placement. 

FIG. 8 is a block diagram of dual in-line buffers 306 connected to double word muxes 602, 604, 606, 
and 608 shown in greater detail in accordance with one embodiment of the present invention. Doubleword 
muxes 602, 604, 606, and 608 select 32 bytes out of the 64 bytes stored in dual in-line buffers 306, which include 
line buffer 0 (32 bytes) and line buffer 1 (32 bytes). The 32 bytes of data selected by doubleword muxes 602, 
604, 606, and 608 are then transmitted to RAT unit 61 1 of the instruction data path as discussed above with 
respect to FIG. 6. Doubleword muxes 602, 604, 606, and 608 are essentially 2:1 muxes that select a doubleword 
(8 bytes) from either line buffer 0 (even octword) or line buffer 1 (odd octword). Doubleword muxes 602, 604, 
606, and 608 are used to take advantage of the fact that at most 20 bytes of the 32 bytes of instruction data will 
be used. The granularity of the muxes may be set to any size down to single-byte granularity. The doubleword 
granularity is chosen based upon simplification of truth tables as shown in Tables 1 and 2 in accordance with one 
embodiment of the present invention. 
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FIG. 9 is a functional diagram of the operation of RAT unit 61 1 shown in greater detail in accordance 
with one embodiment of the present invention. In particular, RAT unit 6 11 includes a RAT megacell 702. RAT 
megacell 702 performs the functionality of twenty 32:1 byte-wide muxes. The inputs to each of the 32:1 muxes 
come from the outputs of doubleword muxes 602, 604, 606, and 608. The inputs to doubleword muxes 602, 604, 
5 606, and 608 come from line buffer 0 and line buffer 1 of dual in-line buffers 306. The byte positions in the line 
buffers 0 and 1 are labeled [0A...0Z, 0a...0f] for line buffer 0 and [1A...1Z, la...lf] for line buffer 1. The inputs 
to each consecutive 32:1 mux in RAT megacell 702 are identical to the previous mux, except the ordering is 
rotated to the left by one byte. Accordingly, this can simplify the RAT megacell implementation as follows: 
inputs to each mux can be routed identically, and the 32-byte mux select bus can be rotated one position for each 
10 mux, mux #0 (704), mux #1 (706), and mux #19 (708). If the correct double words are provided to RAT 

megacell 702, then only one set of decode logic is needed to specify the shift amount. The rotated and truncated 
output from RAT unit 61 1 is transmitted to instruction queue 302. 

FIG. 10 is a functional diagram of a symbolic implementation of RAT unit 61 1 in accordance with one 
embodiment of the present invention. RAT unit 61 1 receives the 32 bytes of instruction data presented by 

1 5 doubleword muxes 602, 604, 606, and 608 and performs a rotation to left justify the byte at the address offset. 
RAT unit 611 then truncates the instruction data to provide, for example, a 20-byte VLIW packet. Thus, RAT 
unit 61 1 essentially implements the functionality of a 32: 1 mux. The primary function is to map any one of 32 
bytes to, for example, each one of the 20 bytes in a 20-byte VLIW packet. Because a 32: 1 mux is expensive 
from a floor planning and circuit implementation standpoint, RAT unit 6 1 1 is implemented as a two-level 32 : 1 

20 mux in accordance with one embodiment of the present invention. A First level 1 002 includes eight 4: 1 muxes 
for every bit of the aligner input. A second level 1004 includes one 8:1 mux for every bit of the aligner input. 

However, by recognizing that all the inputs are the same for bit n of each byte of bytes 0- 1 9 (assuming a 
20-byte VLIW packet), some combining of bits is possible to reduce wiring in the RAT implementation. 
Accordingly, in one embodiment, the muxes for bit n for 4 bytes are grouped together. The bit ordering of the 

25 first few bits is discussed below with respect to FIG. 1 1 . Because the bits of the VLIW packet are produced out 
of order, an additional routing channel is used to "re-order" the bits. The grouping size of 4 bytes means that the 
channel must be wide enough to re-order 32 bits (e.g., a routing overhead of approximately 50-60 urn). In each 
4-byte wide grouping (bit n for 4 bytes), two levels of muxes can be used to implement the 32: 1 mux for each bit. 
Ordering of the inputs to the eight 4: 1 muxes in generation of the selects (select or control signals) allows the 

30 same eight 4:1 muxes to be used for each bit. Thus, eight 4:1 muxes and four 8:1 muxes are used for every 4 

bits, instead of eight 4:1 and one 8: 1 mux for every bit, which results in a reduction of muxes from 1440 (9x160) 
to 480 (12x40). 

FIG. 1 1 is a functional diagram of a RAT bit ordering in accordance with one embodiment of the 
present invention. Because 32 inputs span across 4 bits instead of 1 , the bit slice pitch can be reduced by using 
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the RAT bit ordering as shown in FIG. 1 1. For example, a 14.34 urn (microns) pitch can be used instead of a 25 
urn pitch, which translates into a savings of about 750 urn in the width of the instruction data path. 

FIG. 12 is a functional diagram of a RAT physical implementation in accordance with one embodiment 
of the present invention. The input to the RAT physical implementation of FIG. 12 is the same or identical for 

5 each of the eight 4:1 muxes. By recognizing that each of the inputs to the eight 4:1 muxes are the same (i.e., the 
same 32 bytes of data), each of the eight 4: 1 muxes can be implemented as shown in a block 1204. Block 1204 
shows eight 4: 1 muxes (A, B, C, D, E, F, G, and H) and four 8: 1 muxes (0, 1 , 2, and 3). Each of the eight 4: 1 
muxesissetorcontrolledinordertooutputaparticularbitnofeachselectedbyte. For example, block 1204 
outputs bit 7 of a 4-byte group 1202, and thus, block 1204 outputs bit 7 of bytes 0, 1, 2, and 3, which represents 

10 an output of bit 159, bit 151 , bit 143, and bit 135. The output is then sent to a channel 1206, which reorders the 
bits into descending order. For example, assuming 20 bytes of instruction data, such as a 20-byte VLl W packet, 
channel 1206 reorders the 160 bits or 20 bytes of data from bit number 159 in descending order to bit number 0. 
Because not all of the outputs of the eight 4:1 muxes are necessarily selected, "do not care" conditions can be 
provided in the mux selection or control logic. Thus, this embodiment enables some combination of the 4: 1 mux 

1 5 selects. A truth table for the mux select control signals is shown in Tables 3-6 in accordance with one 

embodiment of the present invention. Further, the controls of the muxes are generated based upon the offset in 
the address offset register (not shown). The controls for each 4:1 mux (muxes A, B, C, D, E, F, G, and H) and 
each 8:1 mux (0, 1, 2, and 3) can be shared across the entire RAT unit if the bits are ordered carefully. 

FIG. 13 is a functional diagram of an input byte ordering for each 4-byte group that allows the mux's 
20 select control signals to be shared in accordance with one embodiment of the present invention. For example, for 
bytes 0-3 (B0-B3), 8:1 mux A selects bits 0-7 from bytes 0, 8, 16, and 24, 8:1 mux B selects bits 0-7 from bytes 
1, 9, 17 and 25, .... and 8:1 mux H selects bits 0-7 from bytes 7, 15, 23, and 31. Accordingly, the input byte 
ordering for each four-byte group advantageously allows the mux selects to be shared as discussed above. 

FIG. 14 is a block diagram of instruction queue 302 shown in greater detail in accordance with one 
25 embodiment of the present invention. Instruction queue 302 is a four-entry instruction queue that provides a 
decoupling buffer between instruction fetch unit 108 and processors 110 and 1 12. As discussed above, every 
cycle, instruction fetch unit 108 provides an instruction packet (e.g., a VLIW packet). The instruction packet is 
passed onto the processors for execution if the processors are ready for a new instruction. For example, in two 
cases, a VLIW packet is produced that cannot be executed immediately. First, if the execution pipeline is stalled 
30 (e.g., for load dependency), then the VLIW packet is written to instruction queue 302. Second, when a pair of 

instructions for a particular processor such as the GFU is present, only one GFU instruction can be executed, and 
the other GFU instruction is queued in instruction queue 302. When the instruction fetch pipeline is stalled due 
to an instruction cache miss, for example, some of the penalty for the instruction fetch pipeline stall can be 
hidden by having valid entries buffered in instruction queue 302. 
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Instruction queue 302 is a four-entry FIFO (First In First Out) queue that can be implemented as a static 
register file. Control logic in instruction fetch unit 108 can provide FIFO pointers. The tail entry of instruction 
queue 302 can be written with either RAT unit 611 or the second instruction of a GFU pair (e.g., RAT_OUT 
[1 19:80]). A read can be implemented using a 4:1 mux in instruction queue 302. Thus, bits 159-120 of 
instruction queue 302 can be written with either the second instruction of a GFU pair or the output of RAT unit 
611. The rest of the bits (i.e., bits 1 19:0) can be written with the output of RAT unit 611. 

Although particular embodiments of the present invention have been shown and described, it will be 
obvious to those skilled in the art that changes and modifications can be made without departing from the present 
invention in its broader aspects, and therefore, the appended claims are to encompass within their scope all such 
changes and modifications that fall within the true scope of the present invention. 
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TABLE 1 Doubleword Mux Selects 
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0-3 


0000 


1 


0 


1 


0 


1 


0 


Y 
A 


"Y 
s\ 


4-7 


0001 


1 


0 


1 


0 


j 


0 






8-11 


0010 


V 


V 
A 


1 


o 




0 


! 


0 


12-15 


0011 


0 




1 


0 




0 




0 


16-19 
20-23 
24-27 


0100 
0101 
0110 


0 
0 
0 




X 
0 
0 


X 


X 


0 
0 
X 




0 
0 
0 


28-31 
32-35 


0111 
1000 


0 
0 




0 
0 




0 
0 




X 
0 
0 


0 
X 


36-39 


1001 


0 




0 




0 






40-43 


1010 


X 


X 


0 




0 






44-47 


1011 


1 


0 


0 




0 




0 




48-51 


1100 


1 


0 


X 


X 


0 




0 




52-55 


1101 


1 


0 


1 


0 


0 


X 


0 




56-59 


1110 


1 


0 


1 


0 


X 


0 




60-63 


1111 


1 


0 


1 


0 


1 


0 


0 





TABLE 2 



Optimized Doubleword Mux Selects 



Bvte Offsets {PC[5], AOR[4:3]} 



Mux A 
SelO Sell 



MuxB 
SelO Sell 



MuxC 
SelO Sell 



Mux D 
SelO Sell 



0-7 

8-15 

16-23 

24-31 

32-39 

40-47 

48-55 

56-63 



000 
001 
010 
Oil 
100 
101 
110 
111 



1 
0 
0 
0 
0 

1 
1 
1 



0 

1 
1 
1 
1 

0 
0 
0 



I 
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0 
0 
0 
0 

1 
1 



0 
0 

1 
1 
1 
1 

0 
0 



1 
1 
1 

0 
0 
0 
0 

1 



0 
0 
0 

1 
1 
1 
1 

0 



1 
i 
1 
1 

0 
0 
0 
0 



0 
0 
0 
0 

1 
1 
1 
1 



The equations for the doubleword mux selects based upon the optimization are as follows: 

Mux A SelO - 0PC[5] && ! AOR[4] && !AOR[3]) || (PC[5] && AOR[4]) || <FC[5] && AOR[3]) 
Sen - (PC[5] && ! AOR[4] Aft - AOR[3]) || (!PC[5] && AOR[4]) || (!PC[5] && AOR[3]) 

Mux B, SelO = (!PC[5] && !AOR[4]) || (PC[5] && AOR[4]) 
Sell = (!PC[5] && AOR [4]) || (PC[5] && !AOR[4]) 
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TABLE 3 



= (PC[5] && AOR[4] && AOR[3]) || (!PC[5] && !AOR [4]) || (!PC[5] && !AOR[3]) 
= (!PC[5] && AOR[4] && AOR[3]) || (PC[5] && ! AOR [4]) || (PC[5] && !AOR[3]) 



Mux C, SelO 
Sell 



Mux D, SelO = !PC[5] 
Sell =PC[5] 
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TABLE 5 
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The logic equations for the mux selects of the 4: 1 muxes are as follows: 

™„ v « a n SelO = nAO'RMl && 'AOR[31 && !AORf2]) II (AOR[4] && AOR[3] && AOR[2]) 
Muxes A-D S 0 = ^Wg^^ AOR[2 l ]) J „ (!AO R[4] && AOR[3] & & !AOR[2]) 
Se 2 = »AOR 4 && AOR[3] && A0R[2]) II (AOR[4] && !AOR[3] && IAORP ]) 
iel3 = <AOR[4] && > AOR[3] && AOR[2]) II (AOR[4] && AOR[3] && ! AOR[2]) 

Muxes E-H, SelO = !AOR[4] && !AOR[3] 
Sell = !AOR[4] && AOR[3] 
Sel2 = AOR[4] && !AOR[3] 
Sel3 = AOR[4] && AOR[3] 

The 8:1 mux control is much simpler as a result of the routing of the 4:1 mux oujwts. 

This routing can be seen in Figure 9. The logic equations for the mux selects of the 8. 1 muxes are as follows. 

Muxes 0-3, SelO = !AOR[2] && !AOR[l] && !AOR[0] 
Sel 1 = ! AOR[2] && ! AOR[ 1 ] && AOR[0] 
Sel2 = !AOR[2] && AOR[l] && !AOR[0] 
Sel3 = ! AOR[2] && AOR[l ] && AOR[0] 
SeI4 = AOR[2] && !AOR[l] && !AOR[0] 
Sel5 = AOR[2] && !AOR[l] && AOR[0] 

Sel6 = AOR[2] && AOR[l ] && ! AOR[0] 
Sel7 = AOR[2] && AOR[l] && AOR[0] 
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WE CLAIM 

1 1 . An apparatus for an instruction fetch unit aligner of a microprocessor, comprising: 

2 selection logic of an instruction aligner that extracts and aligns a non-power of two size instruction from 

3 power of two size instruction data; and 

4 control logic of the instruction aligner for controlling the selection logic. 

1 2. The apparatus of Claim 2 wherein the selection logic comprises: 

2 multiplexer logic for selecting the non-power of two size instruction from the power of two size 

3 instruction data. 

1 3 . The apparatus of Claim 2 wherein the extraction and alignment of the non-power of two size 

2 instruction from the power of two size instruction data is performed within one clock cycle of the 

3 microprocessor. 

1 4. The apparatus of Claim 3 wherein the multiplexer logic comprises eight 4: 1 multiplexers and 

2 four 8: 1 multiplexers for every 4 bits of the power of two size instruction data. 

1 5. The apparatus of Claim 4 wherein the multiplexer logic further comprises: 

2 four 2: 1 multiplexers that each select 8 bytes to provide the power of two size instruction data. 

1 6. The apparatus of Claim 4 further comprising a routing channel that reorders the bits output 

2 from the multiplexer logic. 

1 7. The apparatus of Claim 1 wherein the non-power of two size instruction comprises a Very 

2 Long Instruction Word (VLIW) packet. 

1 8. The apparatus of Claim 1 wherein the power of two size instruction data comprises 32 bytes of 

2 data, and the non-power of two size instruction comprises 5, 10, 15, or 20 bytes of instruction data. 

1 9. The apparatus of Claim 1 wherein the non-power of two size instruction comprises an 

2 instruction packet that comprises a packet header, the packet header indicating a number of instructions in the 

3 instruction packet. 
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1 10. An apparatus for an instruction fetch unit aligner of a microprocessor, comprising: 

2 selection logic of the instruction aligner that extracts and aligns a non-power of two size instruction 

3 from power of two size instruction data; and 

4 control logic of the instruction aligner for controlling the selection logic, wherein the control logic 

5 comprises a decoder. 

1 11. The apparatus of Claim 1 0 wherein the selection logic comprises: 

2 multiplexer logic for selecting the non-power of two size instruction from the power of two size 

3 instruction data. 

1 12. The apparatus of Claim 1 1 wherein the extraction and alignment of the non-power of two size 

2 instruction from the power of two size instruction data is performed within one clock cycle of the 

3 microprocessor. 

1 13. The apparatus of Claim 12 wherein the multiplexer logic comprises eight 4:1 multiplexers and 

2 four 8: 1 multiplexers for every 4 bits of the power of two size instruction data. 

1 14. The apparatus of Claim 1 3 wherein the multiplexer logic further comprises: 

2 four 2:1 multiplexers that each select 8 bytes to provide the power of the two size instruction data. 

1 15. The apparatus of Claim 1 3 further comprising a reorder channel that reorders the bits output 

2 from the multiplexer logic. 

1 16. The apparatus of Claim 1 0 wherein the non-power of two size instruction comprises a Very 

2 Long Instruction Word (VLIW) packet. 

1 17. The apparatus of Claim 10 wherein the control logic is optimized. 

1 1 8. The apparatus of Claim 1 0 wherein the power of two size instruction data comprises 32 bytes 

2 of instruction data. 

1 19. The apparatus of Claim 1 0 wherein the instruction aligner is implemented as a megacell. 

1 20. The apparatus of Claim 1 0 wherein the non-power of two size instruction comprises an 

2 instruction packet that comprises a packet header, the packet header indicating a number of instructions in the 

3 instruction packet. 
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TECHNICAL FIELD 

The present invention 
unit aligner. 



30 



AN INSTRUCTION FETCH UNIT ALIGNER 

relates generally to microprocessors, and more particularly, to an instruction fetch 



RACKGROUNP ART 

A microprocessor typically includes a cache memory for storing copies of the most recently used 
memory locations. The cache memory generally is smaller and faster than main memory (e.g., disk). A 
microprocessor also typically includes an instruction prefetch unit that is responsib.e for prefetching instructions 
for a CPU (Central Processing Unit). In particu.ar, an instruction cache unit is typically organized in a way that 
reduces the amount of time spent transferring instructions having a power of two size into the prefetch umt. For 
example a 256-bit bus (256 bits = 4x8 bytes = 32 bytes) connecting the instruction cache unit and the prefetch 
unit allows a 32-byte instruction prefetch unit to fetch 32 bytes of instruction data in a single cycle of the 
microprocessor. 

DISCLOSURE OF INVENTION 

The present invention provides an instruction fetch unit aligner. For example, the present invention 
provides a cost-effective and high performance apparatus for an instruction fetch unit of a microprocessor that 
executes instructions having a non-power of two size. 

In one embodiment, an apparatus for an instruction fetch unit aligner includes selection logic of an 
instruction aligner that extracts and aligns a non-power of two size instruction (e.g., 5, 10, 15, or 20 bytes of 
instruction data) from power of two size instruction data (e.g., 64 bytes of instruction data), and control log* of 
the instruction aligner for controlling the selection logic. The selection logic is implemented as multiplexer log.c 
for selecting the non-power of two size instruction from the power of two size instruction data. The extract™ 
and a.ignment of the non-power of two size instruction from the power of two s ize instruction data is performed 
within one clock cycle of the microprocessor. For example, four 2: 1 multiplexers that each select 8 bytes of the 
power of two size instruction data can be used to select 32 bytes of instruction data from 64 bytes of instructs 
data in which the non-power of two size instruction is within the selected 32 bytes of instruction data, and the 
multiplexer logic provides 32:1 mux functionality using eight 4:1 mu.tiplexers and four 8:1 mu.tip.exers for 
every 4 bits of the power of two size instruction data. A reorder channel that appropriately reorders the b.ts 
output from the multiplexer logic is also provided. 

Other aspects and advantages of the present invention will become apparent from the following detailed 
description and accompanying drawings. 
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BRIEF DESCRIPTION OF DRAWINGS 

FIG. 1 is a block diagram of a microprocessor that includes an instruction fetch unit in accordance with 
one embodiment of the present invention. 

FIG. 2 shows various formats of instructions having a non-power of two size. 

5 FIG. 3 is a block diagram of an instruction queue and the instruction fetch unit of FIG. 1 shown in 

greater detail in accordance with one embodiment of the present invention. 

FIG. 4 is a functional diagram of the instruction cache unit of FIG. 1 connected to the instruction fetch 
unit of FIG. 1 in accordance with one embodiment of the present invention. 

FIG. 5 is a diagram of possible 5-byte instruction positions within a 32-byte wide cache memory. f 

10 FIG. 6 is a functional diagram of the operation of the instruction fetch unit of FIG. 4 shown in greater 

detail in accordance with one embodiment of the present invention. 

FIG. 7 is a functional diagram of a multi-level implementation of the instruction aligner of FIG. 3 in 
accordance with one embodiment of the present invention. 

FIG. 8 is a block diagram of the line buffers connected to the double word muxes of the instruction fetch 
1 5 unit of FIG. 6 shown in greater detail in accordance with one embodiment of the present invention. 

FIG. 9 is functional diagram of the operation of the rotate and truncate (RAT) unit of FIG. 6 shown in 
greater detail in accordance with one embodiment of the present invention. 

FIG. 10 is a functional diagram of a symbolic implementation of the RAT unit of FIG. 6 in accordance 
with one embodiment of the present invention. 

20 FIG. 1 1 is a functional diagram of a RAT bit ordering in accordance with one embodiment of the 

present invention. 

FIG. 12 is a functional diagram of a RAT physical implementation in accordance with one embodiment 
of the present invention. 

FIG. 13 is a functional diagram of an input byte ordering of each four byte group that allows the mux's 
25 select control signals to be shared in accordance with one embodiment of the present invention. 

FIG. 14 is a block diagram of the instruction queue of FIG. 3 shown in greater detail in accordance with 
one embodiment of the present invention. 



BNSDOCID: <WO 00331 80A3JA> 



WO 00/033180 



PCT/US99/28873 



-3 - 

MODES FOR CARRYING OUT THE INVENTION 

A typical instruction set architecture (ISA) for a microprocessor specifies instructions has a power of 
two size, which can be aligned on a power of two boundary in a conventional cache memory. A typical ISA 
includes 32-bit instructions that are a fixed size such as for RISC (Reduced Instruction Set Computer) processors. 
5 The 32-bit instructions are typically aligned on a 32-bit boundary in a conventional instruction cache unit. The 
32-bit instructions can be prefetched from the instruction cache unit in one clock cycle using a conventional 32- 
bit data path between the prefetch unit and the instruction cache unit. 

However, new instruction set architectures may include instructions having a non-power of two size. To 
efficiently fetch instructions having a non-power of two size, a method in accordance with one embodiment of 

10 the present invention includes fetching at least two sequential cache lines for storage in line buffers of an 

instruction fetch unit of a microprocessor, and then efficiently extracting and aligning all the bytes of a non- 
power of two size instruction from the line buffers. This approach allows for a standard instruction cache 
architecture, which aligns cache lines on a power of two boundary, to be used. This approach also reduces the 
data path between the instruction cache and the instruction fetch unit. This approach sustains a fetch of always at 

15 least one sequential instruction per clock cycle of the microprocessor. 

For example, an ISA can require supporting execution of instruction packets such as VLIW (Very Long 
Instruction Word) packets that are either 5, 10, 15, or 20 bytes wide. For certain applications such as graphics or 
media code, there may predominantly be 20-byte wide VLIW packets. If a 20-byte VLIW packet is executed per 
clock cycle (e.g., at a peak execution rate), then to maintain this peak execution rate, the instruction fetch unit 
20 fetches at least 20 bytes per clock cycle from the instruction cache unit. 

FIG. 1 is a block diagram of a microprocessor 100 that includes an instruction fetch unit (IFU) 108 in 
accordance with one embodiment of the present invention. In particular, microprocessor 100 includes a main 
memory 102 connected to a bus 104, an instruction cache unit 106 connected to bus 104, instruction fetch unit 
108 connected to instruction cache unit 106, and PI processor 1 10 and P2 processor 1 12 each connected to 
25 instruction fetch unit 108. In one embodiment, PI processor 1 10 is provided (i.e., instead of PI processor 110 
and P2 processor 1 12), and PI processor 1 10 is connected to instruction fetch unit 108. 

In one embodiment, instruction cache unit 106 is a conventional 16-kilobyte dual-ported cache that uses 
a well-known (standard) cache architecture of two-way set associative, 32-byte lines (e.g., in order to minimize 
cost and timing risk). Instruction cache unit 106 returns a new 32-byte cache line to instruction fetch unit 108 
30 during each clock cycle of microprocessor 100, and thus, instruction cache unit 106 can satisfy an execution rate 
of, for example, a 20-byte VLIW packet per clock cycle of microprocessor 100. 

However, the 20-byte VLIW packets may not be aligned on the 32-byte cache line boundaries of 
instruction cache unit 106. VLIW packets can start on any byte boundary, and an empirical observation reveals 
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that a significant number of the VLIW packets often start on a first cache line and continue onto a second cache 
line of two sequential cache lines. For VLIW packets that span two cache lines, two clock cycles would typically 
be needed to fetch the entire VLIW packet before executing the VLIW packet. As a result, the execution pipeline 
of microprocessor 100 may be reduced to approximately one half, thus resulting in a significant performance 
5 degradation. 

Accordingly, instruction fetch unit 108 stores two instruction cache lines fetched from instruction cache 
unit 106 to ensure that instruction fetch unit 108 can provide the next VLIW packet, regardless of whether or not 
the VLIW packet spans two cache lines, in a single clock cycle. In particular, instruction fetch unit 108 
prefetches ahead of execution, predicts branch outcomes, and maintains two sequential cache lines of unexecuted 
10 instructions. For example, a 20-byte VLIW packet is extracted from the two sequential instruction cache lines of 
instruction fetch unit 108 and then appropriately aligned, and the extraction and alignment is completed in one 
clock cycle (assuming the two sequential cache lines stored in instruction fetch unit 108 represent valid data). 
For sequential execution, instruction fetch unit 108 provides at least one VLIW packet per clock cycle, regardless 
of whether or not the VLIW packet spans two cache lines in instruction cache unit 106. 

] 5 In one embodiment, instruction cache unit 106 is a shared instruction cache unit for multiple processors 

(e.g., PI processor 1 10 and P2 processor 112). 

A typical instruction fetch unit provides a 4-byte granularity. In contrast, instruction fetch unit 108 
provides a 1-byte granularity and can fetch instructions with a 1-byte granularity. Instruction fetch unit 108 
extracts and aligns a 5, 10, 15, or 20 byte VLIW packet from 64 bytes of instruction data stored in instruction 
20 fetch unit 108 (e.g., an instruction cache line of an instruction cache unit 106 is 32-bytes). Instruction fetch unit 
108 efficiently performs the align operation as discussed below. 

FIG. 2 shows various formats of instructions having a non-power of two size. In particular, instruction 
format 202 shows an instruction format for a variable size opcode which includes an 8-bit to 16-bit opcode, a 6- 
bitto 10-bit destination, a 6-bit to 10-bit source l,a6-bitto 1 0-bit source 2, anda6-bitto 10-bit source 3. 
25 Format 202 ranges from 32 bits to 56 bits. Instruction format 204 shows a 40-bit instruction format which 
includes an 8-bit opcode, an 8-bit destination, an 8 bit source 1 , an 8-bit source 2 and an 8-bit source 3. 

Storing non-power of two size instructions, such as shown in instruction format 204, in a conventional 
DRAM (Dynamic Random Access Memory) or other conventional cache memory that includes cache lines of 
power of two size (e.g., because of binary addressing) results in non-aligned instructions being stored in the 
30 instruction cache. Thus, one embodiment of the present invention allows for the fetching of non-power of two 
size instructions from an instruction cache unit in one clock cycle of the microprocessor. For example, a typical 
DRAM has a width of a power of two number of bits (e.g., 32 bytes). Similarly, on-chip memory is typically 
organized using power of two boundaries and addressing. Thus, non-power of two instruction sets, such as 
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• r ^ 904 (\e a forty bit or five byte instruction), are not necessarily aligned when 
shown in the instruction format 204 (i.e., a rorry on or uvc uy 

stored in instruction cache unit 106. 

F,0 3 is . Mock diagram of an instruc.ion queue 302 and insmtction fetch unit 108 shown in greater 
« in accordance with on. embodhnen, of ,he present invention, .nation fetch unit ,08 is connected* 
cache unit .0. via a convention,, 32-byte da,a part, mstruction fetch unit ,08 includes , pr=f.«h 
unit 304 Prefetch unit 304 include, dual in-line buffers 306. Duai in-line buffers 306 are implemented as, for 
example two 32-byte wide registers. Duai in-.ine buffers 306 store two sequential iines of insnuctions fetched 

uni, ,06, instruction fetch unit ,08 essentiaiiy ensures the subsequent instruction is stored ,n dual ,n-„ne 
buffer, 306, regardless of whether or not it represents a non-aligned instruction (e.g., the instruct spans two 
lines in Action cache unit ,06). Tnus, action fetch u„„ ,08 so.ves the prob.em of havmg to reques two 
motion fetches from inaction cache uni, , 06, which typica,,, causes a waste of a, ,«as, one Cock cycle of 



the microprocessor. 

Instruction fetch uni, ,08 also includes an instruction aligner 308. Instruction aligner 308 extracts and 
a„g„s the non-power of »0 size insuudon from insttuction dat, s»red in dua, in-line buffers 306. For 
X, for a 40-bi. insnuctlon, instruction a.igne, 308 extras the 40-bi. instruction from the 64 bytes of a. 
stored in du.1 in-line buffers 306. Instruction aligner 308 ,hen efficiently aligns the 40-bi, instrucon. as further 

discussed below. 

,„ one embodiment microprocessor ,00 includes four processors or CPUs (Centra, Processing Units). 
Mto oprocessor , 00 execn.es up ,o four instntctions pet cyc,e. Insn.cion fe,ch unit ,08 provides up ,0 four 
actions per cycle ,o Action queue 30, to maintain the pea, exeeuuon rate of four _s per cyc,e. 
for examp,e. for a 40-bi, insmtcion se,, which defines 40-bi, instruction sizes, insmtcon fetch unt. ,08 
provides up ,o ,60 bit, per cycle in order ,0 provide four instructs per cycle. Thus, _ fetch un„ ,0, 

20-bytes of instruction da,. («.g., a 20-byte VLIW packet) .0 instruction queue 302 per cycle. 
Because dua, in-line buffers 306 store 64 bytes of instruction da,,, instruction aligner 308 is responsible for 



provides up to 



eyeing and appropriate* aligning, for example, the 20 bytes of insmtdion data f„ the next cyc,e that « 
within the 64 byes of instruction data s,ored in dua. in-line buffers 306. Accordingly, m one embodtment, an 
efficient method for fetching instructions having a non-power of two size ,s provded. 

FIG 4 is a functional diagram of instruction cache unit 106 connected to instruction fetch uni. 108 in 
accordance with one embodiment of the present invention. A cache line 402 ,ha, includes 32 bytes ofinseuco, 

tnanuction fetch unit 108 includes dua, in-,in, buffer 306. Dual in-lin. buffers 306 tnclude a „ne buffer 
is 32-bytes wide and a ,in« buffer , that is 32-hyt.s wide. For example, line buffer 0 and line buffer , can be 
implemented as registers of instruction fetch uni, ,08, or ,ine buff., 0 and ,ine buffer , of dua, ,n-„ne buffer, 
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306 can be implemented as two sets of enable-reset flip-flops, in which the flip-flops can be stacked (two in one 
bit slice). The 32-bytes of data are then extracted from dual in-line buffers 306 and transmitted via a 32-byte 
data path 406 to instruction aligner 308. Instruction aligner 308 extracts and aligns the instruction (e.g., 10 bytes 
of instruction data) from the 32 bytes of instruction data and then transmits the extracted and aligned instruction 
5 for appropriate execution on processors 1 10 and 1 12 of microprocessor 100. 

Dual in-line buffers-306 maintain two sequential lines of instruction data fetched from instruction cache 
unit 106. After the instruction data is extracted from dual in-line buffers 306, instruction fetch unit 108 fetches 
the next sequential line of instruction data for storage in dual in-line buffers 306. For example, based on the 
address of the fetched data (e.g., if the fifth address bit is zero, then the fetched data is loaded into line buffer 0, 

10 else the fetched data is loaded into line buffer 1), either line buffer 0 or line buffer 1 is purged, and the next 

sequential line of cache memory (e.g., cache line 402 of instruction cache unit 106) is fetched and stored in the 
now purged line buffer 0 or line buffer 1. In steady state mode, instruction fetch unit 108 maintains a rate of 
fetching of 32 bytes of instruction data per cycle. Because only up to 20 bytes of instruction data are consumed 
per cycle in the 20-byte VLIW packet example, and instruction data is stored in memory sequentially, instruction 

15 fetch unit 108 can generally satisfy the peak execution rate of microprocessor 100, such as 20 bytes of instruction 
data or four instructions per multi-processor cycle of microprocessor 100. 

The instruction data path within instruction fetch unit 108 involves, for example, selecting a 20-byte 
wide byte-aligned field from 64 bytes of data stored in dual in-line buffers 306. The 20-byte wide byte-aligned 
field is buffered (e.g., stored in instruction queue 302) and then appropriately presented to the CPUs (e.g., 4 
20 different processors). For a 20-byte VLIW packet, the data path size between instruction cache unit 106 and 
instruction fetch unit 108 can be 32 bytes, because the cache line size is 32 bytes. 

However, extracting a 20-byte wide byte-aligned field from 64 bytes of non-aligned instruction data 
efficiently represents a challenging problem. Accordingly, instruction fetch unit 108 efficiently performs a rotate 
and truncate (RAT) of a 20-byte wide byte-aligned field from 64 bytes of non-aligned instruction data, in which, 
25 for example, 20 bytes is the maximum size of a VLI W packet, and 64 bytes of instruction data is prefetched from 
instruction cache unit 106 in accordance with one embodiment of the present invention, as further discussed 
below. 

FIG. 5 is a diagram of possible 5-byte instruction positions within a 32-byte wide cache memory. Each 
32-byte aligned location is called a cache memory line. An instruction can be located in 32 unique positions in 
30 the five cache memory lines (e.g., cache memory lines 0-5) before the position sequence of FIG. 5 repeats. 

In one embodiment, instruction aligner 308 can select an instruction from any one of these 32 different 
positions along with 0-3 subsequent instructions (assuming a VLIW packet that includes up to four instructions). 
In order to accomplish this task, instruction aligner 308 uses a 5-bit offset pointer indicating where in the 32-byte 
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^ n Init fGFlT) instruction is found for a multiprocessor that 
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buffers 306. 

„ and line buffer . of du„ in-line buffers306) '" d " te ; b ^"™; erm ^ upall0cwod . me fust double 

„, 64, and 128). Line buffer 1 (e.g., ta. buffer ° ^ 1M inc , udes «„ 

2:1 doubleword muxes to concatenate any four d & ^ ^ which (S 

doublets (32 bytes) provide the data needed to extract a 2^byte nstn, 

■„„ hnth of the line buffers include valid data), 
a maximum of 20 bytes (assuming both of the line 

,„ 0K ^ imCT ,,,e— ----- — rcrrr 

from du.1 in-line buffers 306 to pre* al.gne, ,«pu. 6.0 ™ ^ ^ 
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word muxes to provide RAT output 612. 
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aligned VLIW packet. 
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Referring to the selection of bytes of instruction data stored in dual in-line buffers 306, the selection is 
performed by using the known start address of the VLI W packet, and then extracting the next sequential bytes 
using doubleword muxes 602, 604, 606, and 608 to provide 32-byte aligner input 610. For example, a VLIW 
packet can be 5, 1 0, 1 5, or 20 bytes (e.g., it depends on whether or not the compiler generated 1 , 2, 3, or 4 
5 instructions in parallel, that is, for execution in a single cycle on the multi-processor), in which the first two bits 
of the VLIW packet represent a packet header that indicates how many instructions are included in the VLIW 
packet. Thus, when a VLIW packet is decoded, it can be determined that only 10 bytes of instruction data are 
needed (e.g., two instructions were compiled for execution in parallel in a particular cycle). 

Aligner input 610 represents 32 bytes of instruction data within which resides up to 20 bytes of non- 
10 aligned VLIW data. RAT unit 6 1 1 performs a RAT operation that extracts and aligns non-power of two size 
instruction data from the power of two size instruction data (e.g., a 20-byte VLIW packet from 32 bytes of 
aligner input 610) to provide RAT output 612. The RAT operation can be implemented using twenty 32:1 muxes 
using two levels of muxes, eight 4: 1 muxes, each of which connects to an 8: 1 mux to effectively provide a 32: 1 
mux, which represents a brute force approach. However, a more efficient approach is discussed below. 

FIG. 7 is a functional diagram of a multi-level implementation of instruction aligner 308 in accordance 
with one embodiment of the present invention. In particular, instruction aligner 308 is implemented using two 
levels of muxes, which includes a first level mux select 802 and a second level mux select 804. The first level of 
muxes includes eight 4:1 byte-wide muxes. The second level of muxes includes an 8:1 byte-wide mux. 
Logically, there is a two-level mux structure for each bit of the 20 bytes input to instruction aligner 308. Mux 
select controls 802 and 804 are updated every cycle in order to sustain alignment of one VLIW packet per cycle. 
For example, instruction aligner 308 can be implemented as a megacell that is organized with a stacked bit cell 
placement. 

FIG. 8 is a block diagram of dual in-line buffers 306 connected to double word muxes 602, 604, 606, 
and 608 shown in greater detail in accordance with one embodiment of the present invention. Doubleword 

25 muxes 602, 604, 606, and 608 select 32 bytes out of the 64 bytes stored in dual in-line buffers 306, which include 
line buffer 0 (32 bytes) and line buffer 1 (32 bytes). The 32 bytes of data selected by doubleword muxes 602, 
604, 606, and 608 are then transmitted to RAT unit 61 1 of the instruction data path as discussed above with 
respect to FIG. 6. Doubleword muxes 602, 604, 606, and 608 are essentially 2:1 muxes that select a doubleword 
(8 bytes) from either line buffer 0 (even octword) or line buffer 1 (odd octword). Doubleword muxes 602, 604, 

30 606, and 608 are used to take advantage of the fact that at most 20 bytes of the 32 bytes of instruction data will 
be used. The granularity of the muxes may be set to any size down to single-byte granularity. The doubleword 
granularity is chosen based upon simplification of truth tables as shown in Tables 1 and 2 in accordance with one 
embodiment of the present invention. 
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the RAT bit ordering as shown in FIG. 1 1. For example, a 14.34 um (microns) pitch can be used instead of a 25 
um pitch, which translates into a savings of about 750 um in the width of the instruction data path. 

FIG. 12 is a functional diagram of a RAT physical implementation in accordance with one embodiment 
of the present invention. The input to the RAT physical implementation of FIG. 12 is the same or identical for 
5 each of the eight 4:1 muxes. By recognizing that each of the inputs to the eight 4:1 muxes are the same (i.e., the 
same 32 bytes of data), each of the eight 4:1 muxes can be implemented as shown in a block 1204. Block 1204 
shows eight 4: 1 muxes (A, B, C, D, E, F, G, and H) and four 8:1 muxes (0, 1,2, and 3). Each of the eight 4: 1 
muxes is set or controlled in order to output a particular bit n of each selected byte. For example, block 1204 
outputs bit 7 of a 4-byte group 1202, and thus, block 1204 outputs bit 7 of bytes 0, 1, 2, and 3, which represents 

10 an output of bit 159, bit 151, bit 143, and bit 135. The output is then sent to a channel 1206, which reorders the 
bits into descending order. For example, assuming 20 bytes of instruction data, such as a 20-byte VLIW packet, 
channel 1206 reorders the 160 bits or 20 bytes of data from bit number 159 in descending order to bit number 0. 
Because not all of the outputs of the eight 4: 1 muxes are necessarily selected, "do not care" conditions can be . 
provided in the mux selection or control logic. Thus, this embodiment enables some combination of the 4:1 mux 

15 selects. A truth table for the mux select control signals is shown in Tables 3-6 in accordance with one 

embodiment of the present invention. Further, the controls of the muxes are generated based upon the offset in 
the address offset register (not shown). The controls for each 4:1 mux (muxes A, B, C, D, E, F, G, and H) and 
each 8:1 mux (0, 1 , 2, and 3) can be shared across the entire RAT unit if the bits are ordered carefully. 

FIG. 13 is a functional diagram of an input byte ordering for each 4-byte group that allows the mux's 
20 select control signals to be shared in accordance with one embodiment of the present invention. For example, for 
bytes 0-3 (B0-B3), 8: 1 mux A selects bits 0-7 from bytes 0, 8, 16, and 24, 8:1 mux B selects bits 0-7 from bytes 
1,9, 17 and 25, and 8: 1 mux H selects bits 0-7 from bytes 7, 1 5, 23, and 3 1 . Accordingly, the input byte 
ordering for each four-byte group advantageously allows the mux selects to be shared as discussed above. 

FIG. 14 is a block diagram of instruction queue 302 shown in greater detail in accordance with one 
25 embodiment of the present invention. Instruction queue 302 is a four-entry instruction queue that provides a 
decoupling buffer between instruction fetch unit 108 and processors 1 10 and 112. As discussed above, every 
cycle, instruction fetch unit 108 provides an instruction packet (e.g., a VLIW packet). The instruction packet is 
passed onto the processors for execution if the processors are ready for a new instruction. For example, in two 
cases, a VLIW packet is produced that cannot be executed immediately. First, if the execution pipeline is stalled 
30 (e.g., for load dependency), then the VLIW packet is written to instruction queue 302. Second, when a pair of 

instructions for a particular processor such as the GFU is present, only one GFU instruction can be executed, and 
the other GFU instruction is queued in instruction queue 302. When the instruction fetch pipeline is stalled due 
to an instruction cache miss, for example, some of the penalty for the instruction fetch pipeline stall can be 
hidden by having valid entries buffered in instruction queue 302. 
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TABLE 1 Doubleword Mux Selects 







Mux A 




Mux B 




Mux C 




Mux D 




Byte Offsets 


{PC[5] AOR[4:2]} 


SelO 


Sell 


SelO 


Sell 


SelO 


Sell 


SelO 


Sell 


0-3 


0000 


1 


0 


1 


0 


1 


0 


X 


X 


4-7 


0001 


I 


0 


1 


0 


1 


0 




0 


8-11 


0010 


X 


X 


1 


0 




0 




0 


12-15 


0011 


0 




1 


0 




0 




0 


16-19 


0100 


0 




X 


X 




0 




0 


20-23 


0101 


0 




0 






0 




0 


24-27 


0110 


0 




0 




X 


X 




0 


28-31 


0111 


0 




0 




0 






0 


32-35 


1000 


0 




0 




0 




X 


X 


36-39 


1001 


0 




0 




0 




0 




40-43 


1010 


X 


X 


0 




0 




0 




44-47 


1011 




0 


0 




0 




0 




48-51 


1100 




0 


X 


X 


0 




0 




52-55 


1101 




0 


i 


0 


0 




0 




56-59 


1110 




0 


1 


0 


X 


X 


0 




60-63 


1111 




0 


1 


0 


1 


0 


0 





TABLE 2 Optimized Doubleword Mux Selects 







Mux A 




Mux B 




Mux C 




Mux D 




Byte Offsets 


{PC[5],AOR[4:3]> 


SelO 


Sell 


SelO 


Sell 


SelO 


Sell 


SelO 


Sell 


0-7 


000 


1 


0 


1 


0 


1 


0 


1 


0 


8-15 


001 


0 


1 


1 


0 


1 


0 


1 


0 


16-23 


010 


0 


1 


0 


1 


1 


0 


1 


0 


24-31 


on 


0 


1 


0 


I 


0 


1 


1 


0 


32-39 


100 


0 


I 


0 


1 


0 


I 


0 


1 


40-47 


101 


1 


0 


0 


1 


0 


1 


0 


1 


48-55 


110 


1 


0 


1 


0 


0 


1 


0 


1 


56-63 


111 


1 


0 


1 


0 


1 


0 


0 


1 



The equations for the doubleword mux selects based upon the optimization are as follows: 



Mux A, SelO - (!PC[5] && !AOR(4] && !AOR[3]) || (PC[5] && AOR[4]) || (PC[5] && AOR[3]) 
Sell = (PC[5] && !AOR[4] && !AOR[3]) || (!PC[5] && AOR[4]) || (!PC[5] && AOR[3]) 

Mux B, SelO = (!PC[5] && !AOR[4]) || (PC[5] && AOR[4]) 
Sell = (!PC[5] && AOR [4]) || (PC[5] && !AOR[4]) 
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TABLE 3 



Mux C, SelO - (PC[5] && AOR[4] && AOR[3]) || (!PC[5] && !AOR [4]) || (!PC[5] && ! AOR[3]) 
Sell = (!PC[5] && AOR[4] && AOR[3]) || (PC[5] && !AOR [4]) || (PC[5] && !AOR[3]) 

Mux D, SelO — !PC[5] 
SeIl=PC[5] 



BNSDOCID: <WO 00331 80A3_IA> 



WO 00/033180 



PCT7US99/28873 



- 14 - 



TABLE 4 



AOR 




Mux E 
0 1 


2 


3 


Mux F 
0 I 


2 


3 


Mux G 
0 1 


2 


3 


Mux H 
0 1 


2 


3 


00000 


0 


X 


X 


X 


X 


X 


X 


A 


A 


X 


X 


A 


A 


X 


X 


A 


X 


00001 


1 


1 


0 


0 


0 


X 


X 


A 


A 


X 


X 


A 


A 


X 


X 


X 


X 


00Q10 


2 


1 


0 


0 


0 


1 


0 


0 


0 


X 


X 


X 


X 


X 


X 


X 


X 


00011 


3 


1 


0 


0 


0 


I 


0 


0 


0 


1 


0 


0 


0 


X 


X 


X 


X 


00100 


4 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


00101 


5 


X 


X 


X 


X 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


001 10 


6 


X 


X 


X 


X 


X 


X 


X 


X 


1 


0 


0 


0 


1 


0 


0 


0 


001 11 


7 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


1 


0 


0 


0 


01000 


8 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


01001 


9 


0 


1 


0 


0 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


01010 


10 


0 


1 


0 


0 


0 


1 


0 


0 


X 


X 


X 


X 


X 


X 


X 


X 


01011 


11 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


X 


X 


X 


X 


01100 


12 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


01101 


13 


X 


X 


X 


X 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


oino 


14 


X 


X 


X 


X 


X 


X 


X 


X 


0 


1 


0 


0 


0 


1 


0 


0 


01111 


15 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


0 


1 


0 


0 


10000 


16 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


i nnn i 
IUUU I 


\ 1 


0 


0 


1 
1 


A 

u 


X 


X 


v 

.A. 


Y 


X 
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TABLE 5 



AOR 



Muxes A-D 
0 12 



Muxes E-H 
0 12 3 
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T ABLE 6 



The logic equations for the mux selects of the 4: 1 muxes are as follows: 

Muxes A-D SelO = (!AOR[4] && !AOR[3] && !AOR[2]) II (AOR[4] && AOR[3] && AOR[2]) 
Sell = (!AOR[4] && !AOR[3] && AOR[2]) II (!AOR[4] && AOR[3] && !AOR[2]) 
SeI2 = (!AOR[4] && AOR[3] ScSc A0R[2]) II (AOR[4] && !AOR[3] && !AOR[2]) 
Sel3 = (AOR[4] && !AOR[3] && AOR[2]) II (AOR[4] && AOR[3] && !AOR[2]) 

Muxes E-H, SelO = !AOR[4] && ! AOR[3] 
Sell = !AOR[4] && AOR[3] 
SeI2 = AOR[4] && !AOR[3] 
Sel3 « AOR[4] && AOR[3] 

The 8:1 mux control is much simpler as a result of the routing of the 4:1 mux outputs. 

This routing can be seen in Figure 9. The logic equations for the mux selects of the 8:1 muxes are as follows: 

Muxes 0-3, SelO = !AOR[2] && !AOR[l] && !AOR[0] 
Sell = !AOR[2] && !AOR[l] && AOR[0] 
Sel2 = !AOR[2] && AOR[l] && !AOR[0] 
Sel3 = !AOR[2] && AOR[l] && AOR[0] 
SeI4 = AOR[2] && !AOR[l] && !AOR[0] 
Sel5 = AOR[2] && !AOR[l] && AOR[0] 
Sel6 = AOR[2] && AOR[ 1 ] && ! AOR[0] 
Sel7 = AOR[2] && AOR[l] && AORfO] 
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WE CLAIM 

T A n app»,» t »sfora„ i n Sm «.io»f=.cl.™l. a li 8 »«o f . m icrop ro c« S »r.con,pris» 8; 

power of two size instruction data; and 
control logic of the instruction aligner for controlling the selection logic. 

2 The apparatus of Claim 2 wherein the selection logic comprises: 

instruction data. 

1 • • ^.tinn data is nerformed within one clock cycle of the 

2 instruction from the power of two s,ze mstruct.on data .s perform 

3 microprocessor. 

2 f„ ur8:ln ,..Uip te xersf.r e ,ery4 1 ,ia,of.hcpo»er.f™os to ^«,c, i ona»,. 

5 m „pp, rat »sofC.a™4whe rell , t he m u.t i p.ex., 1 o g i= fM .he,co m pn ! ,e,: 

6 . ^ a pp m „sofC,, ta 4^«^p™^a ro u, ing ch«,»e,*» re o ri e„ 1 he bl ,so U ^ 

2 from the multiplexer logic. 

7 . me apparatus of Claim 1 wherein the non-power of two size instruction comprises a Very 
2 Long Instruction Word (VLIW) Packet. 

a The apparatus of Claim 1 wherein the non-power of two size instruction comprises an 

1 9. The apparatus ^ ^ rf instnJCtlons m the 

2 instruction packet that comprises a packet header, the packet 

3 instruction packet. 
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1 10. An apparatus for an instruction fetch unit aligner of a microprocessor, comprising: 

2 selection logic of the instruction aligner that extracts and aligns a non-power of two size instruction 

3 from power of two size instruction data; and 

4 control logic of the instruction aligner for controlling the selection logic, wherein the control logic 

5 comprises a decoder. 

1 II. The apparatus of Claim 10 wherein the selection logic comprises: 

2 multiplexer logic for selecting the non-power of two size instruction from the power of two size 

3 instruction data. 

1 12. The apparatus of Claim 1 1 wherein the extraction and alignment of the non-power of two size 

2 instruction from the power of two size instruction data is performed within one clock cycle of the 

3 microprocessor. 

1 13. The apparatus of Claim 12 wherein the multiplexer logic comprises eight 4:1 multiplexers and 

2 four 8: 1 multiplexers for every 4 bits of the power of two size instruction data. 

1 14. The apparatus of Claim 13 wherein the multiplexer logic further comprises: 

2 four 2: 1 multiplexers that each select 8 bytes to provide the power of the two size instruction data. 

1 15. The apparatus of Claim 1 3 further comprising a reorder channel that reorders the bits output 

2 from the multiplexer logic. 

1 16. The apparatus of Claim 1 0 wherein the non-power of two size instruction comprises a Very 

2 Long Instruction Word (VLIW) packet. 

1 17. The apparatus of Claim 10 wherein the control logic is optimized. 

1 18. The apparatus of Claim 1 0 wherein the power of two size instruction data comprises 32 bytes 

2 of instruction data. 

1 19. The apparatus of Claim 10 wherein the instruction aligner is implemented as a megacell. 

1 20. The apparatus of Claim 1 0 wherein the non-power of two size instruction comprises an 

2 instruction packet that comprises a packet header, the packet header indicating a number of instructions in the 

3 instruction packet. 
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